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Abstract 

Recently, there has been much interest in 
spectral approaches to learning manifolds — 
so-called kernel eigenmap methods. These 
methods have had some successes, but their 
applicability is limited because they are not 
robust to noise. To address this limita- 
tion, we look at two-manifold problems, in 
which we simultaneously reconstruct two re- 
lated manifolds, each representing a different 
view of the same data. By solving these in- 
terconnected learning problems together and 
allowing information to flow between them, 
two-manifold algorithms are able to succeed 
where a non-integrated approach would fail: 
each view allows us to suppress noise in the 
other, reducing bias in the same way that 
an instrumental variable allows us to re- 
move bias in a linear dimensionality reduc- 
tion problem. We propose a class of al- 
gorithms for two-manifold problems, based 
on spectral decomposition of cross-covariance 
operators in Hilbert space. Finally, we dis- 
cuss situations where two-manifold problems 
are useful, and demonstrate that solving a 
two-manifold problem can aid in learning 
a nonlinear dynamical system from limited 
data. 

1 Introduction 

Manifold learning algorithms are non-linear meth- 
ods for embedding a set of data points into a low- 
dimensional space while preserving local geometry. 
Recently, there has been a great deal of interest in 
spectral approaches to learning manifolds. These ker- 
nel eigenmap methods include Isomap [1 , Locally 
Linear Embedding (LLE) [2], Laplacian Eigenmaps 
(LE) [3], Maximum Variance Unfolding (MVU) @], 
and Maximum Entropy Unfolding (MEU) [5j. These 
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approaches can be viewed as kernel principal compo- 
nent analysis [3] with specific choices of manifold ker- 
nels [7]: they seek a small set of latent variables that, 
through a nonlinear mapping, explains the observed 
high-dimensional data. 

Despite the popularity of kernel eigenmap methods, 
they are limited in one important respect: they gen- 
erally only perform well when there is little or no 
noise. Several authors have attacked this problem 
using methods including neighborhood smoothing [8] 
and robust principal components analysis [9[ [TU] , with 
some success under limited noise. Unfortunately, the 
problem is fundamentally ill posed without some sort 
of side information about the true underlying signal: 
by design, manifold methods will recover extra latent 
dimensions which "explain" the noise. 

We take a different approach to the problem of learning 
manifolds from noisy observations. We assume access 
to a set of instrumental variables: variables that are 
correlated with the true latent variables, but uncorre- 
cted with the noise in observations. Such instrumen- 
tal variables can be used to separate signal from noise, 
as described in Section [3] Instrumental variables have 
been used to allow consistent estimation of model pa- 
rameters in many statistical learning problems, includ- 
ing linear regression [11] . principal component analy- 
sis [H] , and temporal difference learning [121 . Here we 
extend the scope of this technique to manifold learn- 
ing. We will pay particular attention to the case of 
observations from two manifolds, each of which can 
serve as instruments for the other. We call such prob- 
lems two- manifold problems. 

In one-manifold problems, when the noise variance is 
significant compared to the manifold's geometry, ker- 
nel eigenmap methods are biased: they will fail to re- 
cover the true manifold, even in the limit of infinite 
data. Two-manifold methods, by contrast, can succeed 
in this case: we propose algorithms based on spectral 
decompositions related to cross-covariance operators 
in a tensor product space of two different manifold 
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kernels, and show that the instrumental variable idea 
suppresses noise in practice. We also examine some 
theoretical properties of two-manifold methods: we ar- 
gue consistency for a special case in Sec. |3.2.3[ but we 
leave a full theoretical analysis of these algorithms to 
future work. 

The benefits of the spectral approach to two-manifold 
problems are significant: first, the spectral decomposi- 
tion naturally solves the manifold alignment problem 
by finding low-dimensional manifolds (defined by the 
left and right eigenvectors) that explain both sets of 
observations. Second, the connection between man- 
ifolds and covariance operators opens the door to 
solving more-sophisticated machine learning problems 
with manifolds, including nonlinear analogs of canon- 
ical correlations analysis (CCA) and reduced-rank re- 
gression (RRR). 

As an example of this last point, subspace identifi- 
cation approaches to learning non-linear dynamical 
systems depend critically on instrumental variables 
and the spectral decomposition of (potentially infinite- 
dimensional) covariance operators [HI [13 E3 H7J HH] . 
Two-manifold problems are a natural fit: by relat- 
ing the spectral decomposition to our two-manifold 
method, subspace identification techniques can be 
forced to identify a manifold state space, and con- 
sequently, to learn a dynamical system that is both 
accurate and interpretable, outperforming the current 
state of the art. 

2 Preliminaries 

We begin by looking at two well-known classes of non- 
linear dimensionality reduction: kernel principal com- 
ponent analysis (kernel PCA) and manifold learning. 

2.1 Kernel PCA 

Kernel PCA [3] is a generalization of principal com- 
ponent analysis |12j : we first map our d-dimensional 
inputs xi,...,x n € R d to a higher-dimensional fea- 
ture space T using a feature mapping <j> : R d — > J 7 , 
and then find the principal components in this new 
space. If the features are sufficiently expressive, ker- 
nel PCA can find structure that regular PCA misses. 
However, if T is high- or infinite-dimensional, the 
straightforward approach to PCA, via an eigende- 
composition of a covariance matrix, is in general in- 
tractable. Kernel PCA overcomes this problem by as- 
suming that J 7 is a reproducing-kernel Hilbert space 
(RKHS), and that the feature mapping <j> is implicitly 
defined via an efficiently-computable kernel function 
K(x,x') = (<fi(x), 0(x'))jf. Popular kernels include the 
linear kernel K(x,x') = x-x' (which identifies the fea- 



ture space with the input space) and the RBF kernel 
tf(x,x') = exp(- 7 ||x-x'|| 2 /2). 

Conceptually, if we define an infinitely-tall "matrix" 
with columns </>(x,), 4> = (<f>(xi), . . . , </>(x„)), our goal 
is to recover the eigenvalues and eigenvectors of the 
centered covariance operator "T^xx = ^;^H$ T , where 
H is the centering matrix H = I„ — ^-11 T . To avoid 
working with the high- or infinite-dimensional covari- 
ance operator, we instead define the Gram matrix 
G = i^ ,T $. The nonzero eigenvalues of the centered 
covariance S^i ar e the same as those of the centered 
Gram matrix HGH. And, the corresponding unit- 

— 1/2 

length eigenvectors of T,xx are given by ^Hv i X i , 
where Xi and are the eigenvalues and eigenvectors 
of HGH [6 . If data weights P (a diagonal matrix) 
are present, we can instead decompose the weighted 
centered Gram matrix PHGHP. (Note that, perhaps 
confusingly, we center first and multiply by the weights 
only after centering; the reason for this order will be- 
come clear below, in Section [2~2| ) 

2.2 Manifold Learning 

Spectral algorithms for manifold learning, sometimes 
called kernel eigenmap methods, include Isomap [T], 
Locally Linear Embedding (LLE) [5] , Laplacian Eigen- 
maps (LE) [3], and Maximum Variance Unfolding 
(MVU) [4]. These methods seek a nonlinear function 
that maps a high-dimensional set of data points to 
a lower-dimensional space while preserving the mani- 
fold on which the data lies. The main insight behind 
these methods is that large distances in input space are 
often meaningless due to the large-scale curvature of 
the manifold; so, ignoring these distances can lead to 
a significant improvement in dimensionality reduction 
by "unfolding" the manifold. 

Interestingly, these algorithms can be viewed as spe- 
cial cases of kernel PCA where the Gram matrix G is 
constructed over the finite domain of the training data 
in a particular way [7J. Specifically, kernel eigenmap 
methods first induce a neighborhood structure on the 
data, capturing a notion of local geometry via a graph, 
where nodes are data points and edges are neighbor- 
hood relations. Then they solve an eigenvalue or re- 
lated problem based on the graph to embed the data 
into a lower dimensional space while preserving the lo- 
cal relationships. Here we just describe LE; the other 
methods have a similar intuition, although the details 
and performance characteristics differ. 

In LE, neighborhoods are summarized by an adja- 
cency matrix W, computed by nearest neighbors: el- 
ement Wij is nonzero whenever the ith point is one 
of the nearest neighbors of the jth point, or vice 
versa. Non-zero weights are typically either set to 
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1 or computed according to a Gaussian RBF kernel: 
w it j = exp(-7||xj -x J || 2 /2). Now let S,^ = J2j w hjy 
and set C = S^/^W - S)S" 1 / 2 . Finally, eigende- 
compose C to get a low dimensional embedding of the 
data points, discarding the top eigenvector (which is 
trivial): if C = VAV T , the embedding is S 1 / 2 v 2:fc+1 . 
To relate LE to kernel PCA, note that G = W - S 
is already centered, i.e., G = HGH. So, we can view 
LE as performing weighted kernel PC A, w ith Gram 
matrix G and data weights P = S~ 1/2 []^ 

The weighted Gram matrix C can be related to a ran- 
dom walk on the graph defined by W: if we add 
the identity and make a similarity transform (both 
operations that preserve the eigensystem), we get 
S _1 / 2 (C + I)S 1/2 = S _1 W, a stochastic transition 
matrix for the walk. So, Laplacian eigenmaps can be 
viewed as trying to keep points close together when 
they are connected by many short paths in our graph. 

3 Bias and Instrumental Variables 

Kernel eigenmap methods are very good at dimension- 
ality reduction when the original data points sample 
a high-dimensional manifold relatively densely, and 
when the noise in each sample is small compared to 
the local curvature of the manifold (or when we don't 
care about recovering curvature on a scale smaller 
than the noise). In practice, however, observations 
are frequently noisy. Depending on the nature of the 
noise, manifold-learning algorithms applied to these 
datasets produce biased embeddings. See Figures 1-2, 
the "noisy Swiss rolls," for an example of how noise 
can bias manifold learning algorithms. 

To see why, we examine PCA, a special case of man- 
ifold learning methods, and look at why it produces 
biased embeddings in the presence of noise. We first 
show how overcome this problem in the linear case, and 

x The original algorithm actually looks for the small- 
est eigenvalues of — C, which is equivalent. The matrix 
G = S — W is the graph Laplacian for our neighborhood 
graph, and — C is the weighted graph Laplacian — hence the 
algorithm name "Laplacian eigenmaps." 

2 The centering step is vestigial: we can include it, 
C = S- 1/2 H(W-S)HS -1/2 , but it has no effect. The 
discarded eigenvector (which corresponds to eigenvalue 
and to a constant coordinate in the embedding) similarly 
is vestigial: we can discard it or not, and it doesn't af- 
fect pairwise distances among embedded points. Interest- 
ingly, previous papers on the connection between Laplacian 
eigenmaps and the other algorithms mentioned here seem 
to contain an imprecision: they typically connect dropping 
the top (trivial) eigenvector with the centering step — e.g., 
see [5]. 

3 A minor difference is that LE does not scale its em- 
bedding using the eigenvalues of G; the related diffusion 
maps algorithm [19] does. 



then use these same ideas to fix kernel PCA, a non- 
linear algorithm. Finally, in Sec. |4j we extend these 
ideas to fully general kernel eigenmap methods. 

3.1 Bias in Finite-Dimensional Linear Models 

Suppose that Xj is a noisy view of some underlying 
low-dimensional latent variable Zj : x, = Mz; + e, for 
a linear transformation M and i.i.d. zero-mean noise 
term a. Without loss of generality, we assume that Xj 
and z, are centered, and that Cov[zi] and M both have 
full column rank: any component of z, in the nullspace 
of M doesn't affect Xj. 

In this case, PCA on X will generally fail to recover Z: 
the expectation of £, X x = ^XX T is M Cov[z;] M T + 
Cov[ej], while we need M Cov[zj] M T to be able to 
recover a transformation of M or Z. The unwanted 
term Cov[ei] will, in general, affect all eigenvalues and 
eigenvectors of J^xx, causing us to recover a biased 
answer even in the limit of infinite data. 

3.1.1 Instrumental Variables 

We can fix this problem for linear embeddings: in- 
stead of plain PCA, we can use what might be called 
two-subspace PCA. This method finds a statistically 
consistent solution through the use of an instrumental 
variable I12j , an observation y; that is correlated 
with the true latent variables, but uncorrelated with 
the noise in Xj. 

Importantly, picking an instrumental variable is not 
merely a statistical aid, but rather a value judgement 
about the nature of the latent variable and the noise 
in the observations. In particular, we are defining the 
noise to be that part of the variability which is uncor- 
related with the instrumental variable, and the signal 
to be that part which is correlated. 

In our example above, a good instrumental variable 
Yi is a different (noisy) view of the same underlying 
low-dimensional latent variable: = Nz^ + Q for 
some full-column-rank linear transformation N and 
i.i.d. zero- mean noise term Q. The expectation of 
the empirical cross covariance Sxy = -XY T is then 
M Cov(z,j) N T : the noise terms, being independent 
and zero-mean, cancel out. (And the variance of each 
element of Sjcy goes to as n — > oo.) 

In this case, we can identify the embedding by com- 
puting the singular value decomposition (SVD) of the 
covariance: let UDV T = S^y, where U and V are 
orthonormal and D is diagonal. To reduce noise, we 
can keep just the columns of U, D, and V which cor- 
respond to the top k largest singular values (diagonal 
entries of D): (U, D, V) = SVD(S xy , k). If we set k 
to be the true dimension of z, then as n — > oo, U will 
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converge to an orthonormal basis for the range of M, 
and V will converge to an orthonormal basis for the 
range of N. The corresponding embeddings are then 
given by U T X and V T Y. 

Interestingly, we can equally well view Xj as an instru- 
mental variable for y,: we simultaneously find con- 
sistent embeddings of both x< and y^, using each to 
unbias the other. 

3.1.2 Whitening: Reduced-Rank Regression 
and Cannonical Correlation Analysis 

Going beyond two-subspace PCA, there are a num- 
ber of interesting spectral decompositions of cross- 
covariance matrices that involve transforming the vari- 
ables Xj and y, before applying a singular value decom- 
position 12 . For example, in reduced-rank regres- 
sion [201 [TJ] , we want to estimate E [xj | y*] . Define 
£ YY = I YY T and t XY = -XY T ■ Then the ordi- 
nary regression of Y on X is S^y(Syy + ?7l) Y, 
where the regularization term 77I (n > 0) ensures that 
the matrix inverse is well-defined. For a reduced-rank 
regression, we instead project onto the set of rank-/c 
matrices: (U, D, V) = SVD(t XY (t YY + ^I^Y, k). 
In our example above, so long as we let 77 — > as 
n —± 00, U will again converge to an orthonormal ba- 
sis for the range of M. 

To connect back to two-subspace PCA, we can show 
that RRR is the same as two-subspace PCA if we 
first whiten Y. That is, we can equivalently define 
U for RRR to be the top k left singular vectors of the 
covariance between X and the whitened instruments 
Y w = (t YY + r?ir 1/2 Y. (Here, A 1 ' 2 stands for the 
symmetric square root of A, which is guaranteed to 
exist and be invertible if A is symmetric and positive 
definite.) Whitening means transforming a covariance 
matrix toward the identity: if rj = and Syy has full 
rank, then EfY^YTj = SyfSyyS^ 2 = I, while if 
77 is near 0, then E[Y^Y^] w I. 



For symmetry, we can whiten both X and Y before 
computing their covariance. An SVD of the result- 
ing doubly-whitened cross-covariance matrix (Sxx + 
r/I) _1 / 2 Sxr(Syy + ?7l) -1 / 2 is called canonical corre- 
lation analysis [21] , and the resulting singular values 
are called canonical correlations. 

3.2 Bias in Learning Nonlinear Models 



We now extend the analysis of Section [XT] to nonlinear 
models. We assume noisy observations x, = /(Zj) +e», 
where Zj is the desired low-dimensional latent variable, 
ej is an i.i.d. noise term, and / is a smooth function 
with smooth inverse (so that /(zi) lies on a manifold). 
Our goal is to recover / and z, up to identifiability. 



Kernel PCA (Sec. 2.1 1 is a common approach to this 
problem. In the realizable case, kernel PCA gets the 
right answer: that is, suppose that Zj has dimension k, 
and that we have at least k independent samples. And, 
suppose that 0(/(z)) is a linear function of z. Then, 
the Gram matrix or the covariance "matrix" will have 
rank k, and we can reconstruct a basis for the range of 
4>o f from the top k eigenvectors of the Gram matrix. 
(Similarly, if <f> o f is near linear and the variance of 
€i is small, we can expect kernel PCA to work well, if 
not perfectly.) 

However, just as PCA recovers a biased answer in 
the finite dimensional case when the variance of is 
nonzero, kernel PCA will also recover a biased answer 
in this case, even in the limit of infinite data. The bias 
of kernel PCA follows immediately from the example 
at the beginning of Section[3] if we use a linear kernel, 
kernel PCA will simply reproduce the bias of ordinary 
PCA. 

3.2.1 Instrumental Variables 

By analogy to two-subspace PCA, a a natural gener- 
alization of kernel PCA is two-subspace kernel PCA, 
which we can accomplish via a kernelized SVD of a 
cross-covariance operator in Hilbert space. Given a 
joint distribution F[X, Y] over two variables X on X 
and Y on y, with feature maps cj> and v (corresponding 
to kernels K x and K y ), the cross-covariance operator 
Sxy is E[0(x) Cg> v(y)}. The cross-covariance opera- 
tor reduces to an ordinary cross-covariance matrix in 
the finite-dimensional case; in the infinite-dimensional 
case, it can be viewed as a kernel mean map descrip- 
tor [22] for the joint distribution P[X, Y], 

The concept of a cross-covariance operator is helpful 
because it allows us to extend the methods of instru- 
mental variables to infinite dimensional RKHSs. In 
our example above, a good instrumental variable y, is 
a different (noisy) view of the same underlying low- 
dimensional latent variable: y^ = <?(z;) + Q for some 
smoothly invertible function g and i.i.d. zero-mean 
noise term Q. 

We proceed now to derive the kernel SVD for a c ross- 
covariance operator]^] We show below (Sec. 3.2.3) that 
the kernel SVD of the cross-covariance operator leads 
to a consistent estimate of the shared latent variables 
in two-manifold problems which satisfy appropriate as- 
sumptions. 



4 The kernel SVD algorithm previously appeared as an 
intermediate step in [171 |23| : here we generalize the al- 
gorithm to weighted cross covariance operators in Hilbert 
space and give a more complete description, both because 
the method is interesting in its own right, and because this 
generalized SVD will serve as a step in our two-manifold 
algorithms. 



Byron Boots, Geoffrey J. Gordon 



Conceptually, our inputs are "matrices" $ and Y 
whose columns are respectively 0(xj) and w(yi), 
along with data weights Px and Py. The cen- 
tered empirical covariance operator is then Sj^y = 
i($H)(TH) T . The goal of the kernel SVD is then to 
factor Exy (or possibly the weighted variant = 
i($H)PxPy(TH) T ) so that we can recover the de- 
sired bases for $>(x,) and v(yi). 

However, this conceptual algorithm is impractical, 
since S^y can be high- or infinite-dimensional. In- 
stead, we can perform an SVD on the covariance op- 
erator in Hilbert space via a trick analogous to ker- 
nel PCA. We first show how to perform an SVD 
on a covariance matrix using Gram matrices in finite- 
dimensional space, and then we extend the method to 
infinite dimensional spaces in Section |3~2~3) 

3.2.2 SVD via Gram matrices 

We start by looking at a Gram matrix formulation of 
finite dimensional SVD. In standard SVD, the singular 
values of Sxy = ^(XH)(YH) T are the square roots 

of the eigenvalues of Y^xy^yx (where Xyx = ^xy)> 
and the left singular vectors are defined to be the cor- 
responding eigenvectors. We can find identical eigen- 
vectors and eigenvalues through centered Gram matri- 
ces B x = i(XH) T (XH) and By = i(YH) T (YH). 
Let Vj be the right eigenvector of ByB^ so that 
ByBjv, = AjVj. Premultiplying by (XH) yields 

i(XH)(YH) T (YH)(XH) T (XH)v, = A 4 (XH) Vi 

and regrouping terms gives us SxySy^Wi = A^w^ 
where w, = (XH)Vj. So, A^ is an eigenvalue of 
SxySyx, vAi is a singular value of S^y, and 
(XH)viA i is the corresponding unit length left sin- 
gular vector. An analogous argument shows that, if 
w£ is a unit-length right singular vector of Sxy , then 
w£ = (YH)v-A,- ' , where is a unit-length left 
eigenvector of ByBx- Furthermore, we can easily in- 
corporate weights into SVD by using weighted matri- 
ces Cx = PjrBxPi and Cy = PyByPy in place of 
Bx and By. 

3.2.3 Two-subspace PCA in RKHSs 



The machinery developed in Section 3.2.2 allows 



us to solve the two-subspace kernel PCA prob- 
lem by computing the singular values of the 
weighted empirical covariance operator Y X y- We 
define Gx and G-y to be the Gram matrices 
whose elements are K x (x.i,x.j) and K y (yi,yj) re- 
spectively, and then compute the eigendecomposi- 
tion of C Y C X = (P y HGyHPy)(P x HG x HP x ). 
This method avoids any computations in infinite- 



dimensional spaces; and, it gives us compact repre- 
sentations of the left and right singular vectors. E.g., 
if Vj is a right eigenvector of CyCx, then the corre- 
sponding singular vector is w, ; = J^j v i,j v (' x -j — x )pJ> 
where is the weight assigned to data point Xj . 

Under appropriate assumptions, we can show that 
the SVD of the empirical cross-covariance operator 
Sxy = j$HT T converges to the desired value. Sup- 
pose that E[</>(xj) | zj is a linear function of Zj, 
and similarly, that E[u(yj) | Zj] is a linear function 
of Zij^j The noise terms (/>(xj) — E[</>(xj) | z<] and 
v(yi) — E[v(yi) | Zj] are by definition zero-mean; and 
they are independent of each other, since the first de- 
pends only on and the second only on Q. So, the 
noise terms cancel out, and the expectation of Sjy 
is the true covariance Sjy. If we additionally as- 
sume that the noise terms have finite variance, the 
product-RKHS norm of the error Y^xy — Sjy van- 
ishes as n — > oo. 

The remainder of the proof follows from the proof of 
Theorem 1 in [T7] (the convergence of the empirical es- 
timator of the kernel covariance operator). In partic- 
ular, the top k left singular vectors of Sxy converge 
to a basis for the range of E[^(xj) | Zj] (considered 
as a function of Zj); similarly, the top right singular 
vectors of Sjy converge to a basis for the range of 
E[u(yO | *•]. 

3.2.4 Whitening and Kernel SVD 

Just as one can use kernel SVD to solve the two- 
subspace kernel PCA problem for high- or infinite- 
dimensional feature space, one can also compute the 
kernel versions of CCA and RRR. (Kernel CCA is a 
well-known algorithm, though our formulation here is 
different than in [53] , and kernel RRR is to our knowl- 
edge novel.) Again, these problems can be solved by 
pre-whitening the feature space before applying kernel 
SVD. 

To compute a RRR from centered covariates HY in 
Hilbert space to centered responses H3? in Hilbert 
space, we find the kernel SVD of the HY-whitened 
covariance matrix: first we define B x = HGxH 
and By = HGyH; next we compute BxBy(By + 
?7l) _1 By; and, finally, we perform kernel SVD by find- 
ing the singular value decomposition of BjBy(B y + 
r?I)- x By. 



5 The assumption of linearity is restrictive, but appears 
necessary: in order to learn a representation of a man- 
ifold using factorization-based methods, we need to pick 
a kernel which flattens out the manifold into a subspace. 
This is why kernel eigenmap methods are generally more 
successful than plain kernel PCA: by learning an appropri- 
ate kernel, they are able to adapt their nonlinearity to the 
shape of the target manifold. 
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Similarly, for CCA in Hilbert space, we find the kernel 
SVD of the HT- and H<I>-whitcned covariance matrix: 
B X {B X + r,I)- 1 B x B y (B|, + ryl^By. 

If data weights are present, we use instead weighted 
centered Gram matrices Cx = PxBxPx and Cy = 
PyByPy. Then, to generalize RRR and CCA to 
RKHSs, we can compute the eigendecompositions of 
Equations |la| -b, respectively: 

CxC^Cy + ^^Cy (la) 

C X (C 2 X + )] I)- 1 C X Cy(C y + )] I)- 1 Cy (lb) 

The consistency of both approaches follows directly 
from the consistency of kernel SVD, so long as we let 
rj -t as n -> oo P3] . 

4 Two-Manifold Problems 

Now that we have extended the instrumental variable 
idea to RKHSs, we can also expand the scope of mani- 
fold learning to two-manifold problems, where we want 
to simultaneously learn two manifolds for two covary- 
ing lists of observations, each corrupted by uncorre- 
cted noise The idea is simple: we view manifold 
learners as constructing Gram matrices, then apply 
the RKHS instrumental variable idea of Sec. |U As we 
will see below, this procedure allows us to regain good 
performance when observations are noisy. 

Suppose we are given two set of observations residing 
on (or near) two different manifolds: Xi, . . . , x„ £ ~R dl 
on a manifold Mx and yi, . . . , y„ € M. d2 on a manifold 
My. Further suppose that both Xj and y; are noisy 
functions of a latent variable Zj, itself residing on a 
latent fc-dimensional manifold Mz'- x i = f( z i) + e i 
and yi — g(zi) + Q. We assume that the functions / 
and g are smooth, so that /(Zj) and g(zi) trace out 
submanifolds f(Mz) Q Mx and g(Mz) Q My. We 
further assume that the noise terms e; and Q move Xj 
and yi within their respective manifolds Mx and My'- 
this assumption is without loss of generality, since we 
can can always increase the dimension of the manifolds 
Mx and My to allow an arbitrary noise term. See 
Figure [l] for an example. 

If the variance of the noise terms and Q is too high, 
or if Mx and My are higher-dimensional than the 
latent Mz manifold (i.e., if the noise terms move Xj 
and yt away from f(Mz) and g(Mz)), then it may be 

6 The uncorrelated noise assumption is extremely mild: 
if some latent variable causes correlated changes in our 
measurements on the two manifolds, then we are making 
the definition that it is part of the desired signal to be 
recovered. No other definition seems reasonable: if there 
is no difference in statistical behavior between signal and 
noise, then it is impossible to use a statistical method to 
separate signal from noise. 
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Figure 1: The Noisy Swiss Rolls. We are given two 
sets of 3-d observations residing on two different man- 
ifolds Mx and My. The latent signal Mz is 2-d, but 
Mx and My are each corrupted by 3-d noise. (A) 
5000 data points sampled from Mz- (B) The func- 
tions / {Mz) and g {Mz) "roll" the manifold in 3- 
dimensional space, two different ways, generating two 
different manifolds in observation space. (C) Each set 
of observations is then perturbed by 3-d noise (show- 
ing 2 dimensions only), resulting in 3-d manifolds Mx 
and My. The black lines indicate the location of the 
submanifolds from (B). 



difficult to reconstruct f{Mz) or g(Mz) separately 
from Xi or y,-. Our goal, therefore, is to use Xj and 
yi together to reconstruct both manifolds simultane- 
ously: the extra information from the correspondence 
between x^ and y^ will make up for noise, allowing suc- 
cess in the two-manifold problem where the individual 
one-manifold problems are intractable. 

Given samples of n i.i.d. pairs {xi,yi}™ =1 from two 
manifolds, we propose a two-step spectral learning al- 
gorithm for two-manifold problems: first, use cither 
a given kernel or an ordinary one-manifold algorithm 
such as LE or LLE to compute weighted centered 
Gram matrices Cx and Cy from x^ and y^ separately. 
Second, use use one of the cross-covariance methods 
from Section |3. 2. 4| such as kernel SVD, to recover the 
embedding of points in Mz- The procedure, called in- 
strumental eigenmaps, is summarized in Algorithm [TJ 

For example, if we are using LE (Section |2.2[ ), then 
C x = S x 1/2 {W x Sx)S x 1/2 and C Y = S^Wy- 
Sy)Sy , where Wj and Sx are computed from Xj, 
and Wy and Sy are computed from y,. We then 
perform a singular value decomposition and truncate 
to the top k singular values: 

(U,A,V T > =SVD(C x C y ,A:) 
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Algorithm 1 Instrumental Eigenmaps 
In: n i.i.d. pairs of observations {xj,y,-}" =1 
Out: embeddings Ex and Ey 

1: Compute centered Gram matrices: Bx and By 
and data weights Px and Py from xi : „ and yi :n 

2: Compute weighted, centered Gram matrices: 
C x = P x B x Px and Cy = PyByPy 

3: Perform a singular value decomposition and trun- 
cate the top k singular values: 
(U,A,V T ) =SVD(C x Cy,fc) 

4: Find the embeddings from the singular values: 
Ex = P x /2 U 2: fc+iA2^ +1 and 
Ey = P y / 2 V 2:fe+ iA2^ 2 +1 



The embeddings for x, and y, are then given by 

1/2 TT .1/2 j C-1/2-IJ * 1/2 

s x u 2:fc+iA 2:fe+1 and t5 y V2:fe + lA 2 . fe+1 . 

Computing eigenvalues of Cx Cy instead of Cx or Cy 
alone will alter the eigensystem: it will promote direc- 
tions within each individual learned manifold that are 
useful for predicting coordinates on the other learned 
manifold, and demote directions that are not useful. 
As shown in Figure [2] this effect strengthens our abil- 
ity to recover relevant dimensions in the face of noise. 

5 Two-Manifold Detailed Example: 
Learning Dynamical Systems 

A longstanding goal in machine learning and robotics 
has been to learn accurate, economical models of dy- 
namical systems directly from observations. This task 
requires two related subtasks: 1) learning a low di- 
mensional state space, which is often known to lie 
on a manifold; and 2) learning the system dynamics. 
We propose tackling this problem by combining spec- 
tral learning algorithms for non-linear dynamical sys- 
tems [H [TH OH [TBI [23 [32] with two-manifold meth- 
ods. 

Any of the above-cited spectral learning algorithms 
can be combined with two-manifold methods. Here, 
we focus on one specific example: we show how to com- 
bine HSE-HMMs [17 , a powerful nonparametric ap- 
proach to modeling dynamical systems, with manifold 
learning. Specifically, we look at one important step 
in the HSE-HMM learning algorithm: the original ap- 
proach uses kernel SVD to discover a low-dimensional 
state space, and we propose replacing the kernel SVD 
with a two-manifold method. We demonstrate that the 
resulting manifold HSE-HMM can outperform stan- 
dard HSE-HMMs (and many other well-known meth- 
ods for learning dynamical systems) on a difficult real- 
world example: the manifold HSE-HMM accurately 
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Figure 2: Solving the Noisy Swiss Roll two-manifold 
problem (see Fig. [Tjfor setup). Top graphs show em- 
beddings of Xi, bottom graphs show embeddings of y^. 
(A) 2-dimensional embeddings found by LE. The best 
results were obtained by setting the number of near- 
est neighbors to 5. Due to the large amounts of noise, 
the separately learned embeddings do not accurately 
reflect the latent 2-dimensional manifold. (B) The em- 
beddings learned from the left and right eigenvectors of 
C^Cy closely match the original data sampled from 
the true manifold. By treating the learning problem 
as a two-manifold problem, noise disappears in expec- 
tation and the latent manifold is recovered accurately. 



discovers a curved low-dimensional manifold which 
contains the state space, while other methods discover 
only a (potentially much higher-dimensional) subspace 
which contains this manifold. 

5.1 Hilbert Space Embeddings of HMMs 

The key idea behind spectral learning of dynamical 
systems is that a good latent state is one that lets 
us predict the future. HSE-HMMs implement this 
idea by finding a low-dimensional embedding of the 
conditional probability distribution of sequences of fu- 
ture observations, and using the embedding coordi- 
nates as state. Song et al. [T7] suggest finding this 
low-dimensional state space as a subspace of an infi- 
nite dimensional RKHS. 

Intuitively, we might think that we could find the best 
state space by performing PCA or kernel PCA of se- 
quences of future observations. That is, we would sam- 
ple n sequences of future observations Xi , . . . , x„ £ M. dl 
from a dynamical system. (If our training data con- 
sists of a single long sequence of observations, we can 
collect our sample by picking n random time steps 
t%, . . . , t n . Each sample x^ is then a sequence of obser- 
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vations starting at time U.) We would then construct 
a Gram matrix Gx, whose element is K x (xj , ) . 
Finally, we would find the eigendecomposition of the 
centered Gram matrix = HGjH as in Section [2~l"l 
The resulting embedding coordinates would be tuned 
to predict future observations well, and so could be 
viewed as a good state space. 

However, the state space found by kernel PCA is bi- 
ased: it typically includes noise, information that can- 
not be predicted from past observations. We would 
like instead to find a low dimensional state space that 
embeds the probability distribution of possible futures 
conditioned on the past. In particular, we want to find 
a state space that is uncorrelated with the noise in the 
future observations. From a two-manifold perspective, 
we view features of the past as instrumental variables 
to unbias the future. 

Therefore, in addition to sampling sequences of future 
observations, we sample corresponding sequences of 
past observations y i , . . . , y„ € R d2 : sequence y^ ends 
at time U — 1. We then construct a Gram matrix Gy, 
whose (i,j) element is K y (yi,yj). From Gy we con- 
struct the centered Gram matrix By = HGyH. Fi- 
nally, we i dentify the state space using a kernel SVD as 
in Section |3.2.3| (U,A,V T ) = SVD(B x By,fc) (there 



are no data weights for HSE-HMMs). The left singular 



"vectors" (reconstructed from U as in Section 3.2.3) 



now identify a subspace in which the system evolves. 
From this subspace, we can proceed to identify the 
parameters of the dynamical system as in Song et 



5.2 Manifold HSE-HMMs 

In contrast with HSE-HMMs, we are interested in 
modeling a dynamical system whose state space lies 
on a low-dimensional manifold, even if this manifold 
is curved to occupy a higher-dimensional subspace (an 



example is given in Section 5.3 below). We want to 
use this additional knowledge to constrain the learn- 
ing algorithm and produce a more accurate model for 
a given amount of training data. 

To do so, we replace the kernel SVD by a two-manifold 
method. That is, we learn weighted centered Gram 
matrices Cx and Cy for the future and past obser- 
vations, using a manifold method like LE or LLE (see 
Section 2.2 ). Then we apply a SVD to C^Cy in order 



to recover the latent state space. 



7 Other approaches such as CCA and RRR have also 
been successfully used for finite-dimensional spectral learn- 
ing algorithms [18| . suggesting that kernel SVD could be 
replaced by kernel CCA or kernel RRR (Section 3.2.4|. 



5.3 Slotcar: A Real- World Dynamical System 

Here we look at the problem of tracking and predict- 
ing the position of a slotcar with attached inertial mea- 
surement unit (IMU) racing around a track. The setup 
consisted of a track and a miniature car (1:32 scale 
model) guided by a slot cut into the track. Figure^ A) 
shows setup. At a rate of 10Hz, we extracted the es- 
timated 3-D acceleration and angular velocity of the 
car. An overhead camera also supplied estimates of 
the 2-dimensional location of the car. 

We collected 3000 successive observations while the 
slot car circled the track controlled by a constant pol- 
icy (with varying speeds). The goal was to learn a 
dynamical model of the noisy IMU data, and, after 
filtering, to predict current and future 2-dimensional 
locations. We used the first 2000 data points as train- 
ing data, and held out the last 500 data points for 
testing the learned models. We trained four models, 
and evaluated these models based on prediction accu- 
racy, and, where appropriate, the learned latent state. 

First, we trained a 20-dimensional embedded HMM 
with the spectral algorithm of Song et al. [T7], using 
sequences of 150 consecutive observations. Following 
Song et al., we used Gaussian RBF kernels, setting the 
bandwidth parameter with the "median trick" , and 
using regularization parameter A = 10~ 4 . Second, we 
trained a similar 20-dimensional embedded HMM with 
LE kernels. The number of nearest neighbors was se- 
lected to be 50, and the other parameters were set to 
be identical to the first model. (So, the only differ- 
ence is that the first model performs a kernel SVD 
to find the subspace on which the dynamical system 
evolves, while the second model solves a two-manifold 
problem.) Third, for comparison's sake, we trained a 
20-dimensional Kalman filter using the N4SID algo- 
rithm with Hankel matrices of 150 time steps; and 
finally, we learned a 20-state discrete HMM (with 400 
levels of discretization for observations) using the EM 
algorithm. 

Before investigating the predictive accuracy of the 
learned dynamical systems, we looked at the learned 
state space of the first three models. These models 
differ mainly in their kernel: Gaussian RBF, learned 
manifold from Laplacian Eigenmaps, or linear. As a 
test, we tried to reconstruct the 2-dimensional loca- 
tions of the car from each of the three latent state 
spaces: the more accurate the learned state space, the 
better we expect to be able to reconstruct the loca- 
tions. Results are shown in Figure [3]jB); the learned 
manifold is the clear winner. 

Finally we examined the prediction accuracy of each 
model. We performed filtering for different extents 
t\ = 100, . . . , 350, then predicted the car location for 
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Figure 3: Slot car with inertial measurement unit (IMU). (A) The slot car platform: the car and IMU (top) and 
racetrack (bottom). (B) A comparison of training data embedded into the state space of three different learned 
models. Red line indicates true 2-d position of the car over time, blue lines indicate the prediction from state 
space. The top graph shows the Kalman filter state space (linear kernel), the middle graph shows the HSE- 
HMM state space (Gaussian RBF kernel), and the bottom graph shows the manifold HSE-HMM state space 
(LE kernel). The LE kernel finds the best representation of the true manifold. (C) Root mean squared error for 
prediction (averaged over 250 trials) with different estimated models. The HSE-HMM significantly outperforms 
the other learned models by taking advantage of the fact that the data we want to predict lies on a manifold. 



a further £2 steps in the future, for £2 = 1, • • • , 100. 
The root mean squared error of this prediction in the 
2-dimensional location space is plotted in Figure [3](C). 
The Manifold HMM learned by the method detailed 
in Section |5.2| consistently yields lower prediction er- 
ror for the duration of the prediction horizon. This 
is important: not only does the manifold HSE-HMM 
provide a model that better visualizes the data that we 
want to predict, but it is significantly more accurate 
when filtering and predicting than classic and state-of- 
the-art alternatives. 

6 Related Work 

While preparing this manuscript, we learned of the 
simultaneous and independent work of Mahadevan et 
al. |27j . This paper defines one particular two-manifold 
algorithm, maximum covariance unfolding (MCU), by 
extending maximum variance unfolding; but, it does 
not discuss how to extend other one-manifold methods. 
It also does not discuss any asymptotic properties of 
the MCU method, such as consistency. 

A similar problem to the two-manifold problem is 
manifold alignment 28, 29 , which builds connections 
between two or more data sets by aligning their un- 
derlying manifolds. Generally, manifold alignment al- 
gorithms either first learn the manifolds separately 
and then attempt to align them based on their low- 
dimensional geometric properties, or they take the 
union of several manifolds and attempt to learn a 



latent space that preserves the geometry of all of 
them |29j . Our aim is different: we assume paired 
data, where manifold alignments do not; and, we focus 
on learning algorithms that simultaneously discover 
manifold structure (as kernel eigenmap methods do) 
and connections between manifolds (as provided by, 
e.g., a top-level learning problem defined between two 
manifolds) . 

This type of interconnected learning problem has been 
explored before in a different context via reduced-rank 
regression (RRR) HU HO] and sufficient dimension 
reduction (SDR) [32j [33l [34] . RRR is a linear regres- 
sion method that attempts to estimate a set of coeffi- 
cients (3 to predict response vectors y^ from covariate 
vectors Xj under the constraint that j3 is low rank (and 
can therefore be factored). In SDR, the goal is to find 
a linear subspace of covariates that makes response 
vectors y^ conditionally independent of the x^s. The 
formulation is in terms of conditional independence, 
and, unlike in RRR, no assumption is made on the 
form of the regression from to y^. Unfortunately, 
SDR pays a price for the relaxation of the linear as- 
sumption: the solution to SDR problems usually re- 
quires a difficult non-linear non-convex optimization. 

Manifold kernel dimension reduction [35] . finds an 
embedding of covariates x^ using a kernel eigenmap 
method, and then attempts to find a linear transfor- 
mation of some of the dimensions of the embedded 
points to predict response variables y^. The response 
variables are constrained to be linear in the manifold, 
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so the problem is quite different from a two-manifold 
problem. 

There has also been some work on finding a mapping 
between manifolds |36j and learning a dynamical sys- 
tem on a manifold [37] ; however, in both of these cases 
it was assumed that the manifold was known. 

Finally, some authors have focused on the problem of 
combining manifold learning algorithms with system 
identification algorithms. For example, Lewandowski 
et al. |38j introduces Laplacian eigenmaps that accom- 
modate time series data by picking neighbors based 
on temporal ordering. Lin et al. [39] and Li et al. [40] 
propose learning a piecewise linear model that approx- 
imates a non-linear manifold and then attempt to learn 
the dynamics in the low-dimensional space. Although 
these methods do a qualitatively reasonable job of con- 
structing a manifold, none of these papers compares 
the predictive accuracy of its model to state-of-the-art 
dynamical system identification algorithms. 

7 Conclusion 

In this paper we propose a class of problems called two- 
manifold problems, where two sets of corresponding 
data points, generated from a single latent manifold 
and corrupted by noise, lie on or near two different 
higher dimensional manifolds. We design algorithms 
by relating two-manifold problems to cross-covariance 
operators in RKHSs, and show that these algorithms 
result in a significant improvement over standard man- 
ifold learning approaches in the presence of noise. This 
is an appealing result: manifold learning algorithms 
typically assume that observations are (close to) noise- 
less, an assumption that is rarely satisfied in practice. 

Furthermore, we demonstrate the utility of two- 
manifold problems by extending a recent dynamical 
system identification algorithm to learn a system with 
a state space that lies on a manifold. The resulting al- 
gorithm learns a model that outperforms the current 
state-of-the-art in terms of predictive accuracy. To our 
knowledge this is the first combination of system iden- 
tification and manifold learning that accurately iden- 
tifies a latent time series manifold and is competitive 
with the best system identification algorithms at learn- 
ing accurate predictive models. 
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